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ABSTRACT 

This paper presents an approach to spelling correction in agglutinative languages that 
is based on two-level morphology and a dynamic programming based search algorithm. 
Spelling correction in agglutinative languages is significantly different than in languages 
like English. The concept of a word in such languages is much wider that the entries 
found in a dictionary, owing to productive word formation by derivational and inflectional 
affixations. After an overview of certain issues and relevant mathematical preliminaries, 
we formally present the problem and our solution. We then present results from our 
experiments with spelling correction in Turkish, a Ural-Altaic agglutinative language. 
Our results indicate that we can find the intended correct word in 95% of the cases and 
offer it as the first candidate in 74% of the cases, when the edit distance is 1. 

1 Introduction 

Spelling correction is an important component of any system for processing text. Creation 
of textual information is prone to many errors introduced by typing (human) or recognition 
(OCR systems) mistakes. Agglutinative languages such as Turkish or Finnish, differ 
from languages like English in the way lexical forms are generated. Words are formed 
by productive affixations of derivational and inflectional suffixes to roots or stems like, 
"beads-on-a-string" [14]. Furthermore, roots and suffixes (morphemes) may undergo 
changes at the boundaries due to various phonetic interactions. A typical nominal or a 
verbal root may have thousands (or even millions) of valid forms which never appear in 
the dictionary. For instance, we can give the following (rather exaggerated) example from 
Turkish: 

uygarla§tiramayabileceklerirnizdenmi§sinizcesine 

whose root is the adjective uygar (civilized). 1 The morpheme breakdown (with morpho- 
logical glosses underneath) is: 2 

1 This is an adverb meaning roughly "(behaving) as if you were one of those whom we might not be 
able to civilize." 

2 Glosses in parentheses indicate derivations not explicitly indicated by a morpheme. 
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uygar +la§ -\-tir +ama +yabil +ecek 

civilized +AtoV +CAUS +NEG +POT +VtoA(AtoN) 

+ler +imiz +den +mi§ +siniz +cesine 

+3PL +P0SS-1PL +ABL(+NtoV) +PAST +2PL +VtoAdv 

The portion of the word following the root consists of 11 morphemes each of which either 
adds further syntactic or semantic information to, or changes the part-of-speech of, the 
part preceding it. Though most words one uses in Turkish are considerably shorter than 
this, this example serves to point out the fundamental difference of the spelling checking 
and correction problem in such languages. Methods developed for spelling correction 
for languages like English (see the review by Kukich [10]) are not readily applicable to 
agglutinative languages. 

Our prior work has mainly been on spelling checking in Turkish [12, 13], and two-level 
morphological analysis of Turkish [11]. In this work, we develop an algorithm for spelling 
correction for agglutinative languages that we have applied to Turkish. Our approach uses 
a two-level morphological analyzer and generator, 3 coupled with a dynamic-programming 
like search procedure for intelligently enumerating candidate lexical forms from a given 
misspelled form. In the following sections, we overview the spelling correction problem in 
general and in agglutinative languages, present some preliminary definitions and mathe- 
matical background and introduce an algorithm for spelling correction for agglutinative 
languages, and finally present results from our implementation for Turkish. 

2 The spelling correction problem 

Du and Chang [3] define the spelling correction problem as follows: 

From a set of known words (dictionary), find those words that most resemble 
a given (misspelled) character string. 

The keyword in this definition is "resemble." It is difficult to express rigorously how 
two strings resemble. Generally, a distance metric is used to compare two strings. The 
problem then becomes that of finding those words that are neighbors of a given character 
string with respect to a given distance metric. There have been a number of proposals 
to be used as the distance metric in comparing two strings [8, 10, 15]. The most popular 
and widely used metrics are q-gram and linear trace based metrics. In the g-gram metric, 
two strings are compared according to the number of different substrings of length q they 
share. In the linear trace method, two strings are compared according to an edit distance 
metric which measures the extent of changes one needs to apply to one of the strings to 
get the other string. 

3 We should however emphasize that there is nothing specifically dependent in our approach to two-level 
morphology per se. 
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3 Spelling correction in agglutinative languages 



As briefly discussed earlier, agglutinative languages have certain aspects that make the 
spelling correction problem substantially harder and different than that for languages like 
English. The expression "from a set of known words" no longer implies what is usually 
found in typical word list, and now means "all possible words that can be generated 
from a given root word by derivational and inflectional suffixes." For example, Finnish 
nouns have about 2000 distinct forms while Finnish verbs have about f2,000 forms ([4], 
pp. 59-60). The case in Turkish is also similar where nouns may have about 170 basic 
different forms, not counting the forms for adverbs, verbs, adjectives, or other nominal 
forms, generated (sometimes circularly) by derivational suffixes (Hankamer [5] gives much 
higher figures (in the millions) for Turkish.) If we look closely into the problem, it will 
not be difficult to observe that it consists of two subproblems. 4 Given a misspelled word 

1. determine all the roots from the dictionary that can be the root of the misspelled 
word, and 

2. generate (systematically) all the possible words that "resemble" the given character 
string, from roots identified in subproblem 1. 

The first step of the problem is relatively easy because of the static structure of the 
root dictionary. Various techniques developed for spelling correction, say, in English can 
usually be applied here. We will opt not to deal with cases where a root can not be 
determined, especially due to total or near-total deformation. 

The second step is the heart of the problem. Producing all the possible words from 
all the known roots requires an exhaustive generate and test search procedure. 

Our approach differs from that of Aduriz et.al.[l] which also uses a morphological 
analysis approach. This approach is however significantly different than ours in that they 
mainly rely redundant two level rules to do correction while our approach is based on 
exploiting the morphotactics information. 

3.1 Notation 

We denote the set of the surface forms of the roots in the language 5 by i?, and the set 
of lexical forms of the roots by Ri ex . 6 We use X = X\ } x 2} x m , Y = j/i, y 2} y n to 
denote strings from the alphabet of the language. X will denote the surface form of 
the incorrect or misspelled string, and Y will typically denote the surface string that is 
a (possibly partial) candidate word. Y\ ex will denote the lexical form of this candidate 

4 In this paper, we do not deal with languages that have productive prefixes. 
5 From now on, language will refer to an agglutinative language. 

6 Here, we are referring to the two levels of forms in the two-level morphology terminology: the lexical 
form which essentially corresponds to the structure of a word in terms of morphemes etc., and the surface 
form which is the surface realization of the lexical form as allowed by the automata implementing the 
two-level phonetic correspondence rules [14, 2, 9, 6]. 
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string. 7 The notation X[i : j] = Xi, Xi + \, x 3 refers to the substring (from characters 
i to j inclusive) of any string X If i is missing, then the substring refers to the prefix 
of the string up to and including the j th character. X[0] denotes empty substring and 
\X\ denotes the length of string X. We assume the existence of a function, surface^) 
to generate surface strings from lexical strings, i.e., sur f ace(Yi ex ) = Y. The function 
surface^) applies the constraints imposed by the automata implementing the two-level 
morphophonemic rules for the language. 

3.2 Distance metrics 

In both parts of the problem, we need some criteria to measure how much two strings 
resemble each other. Two most widely accepted and readily applicable metrics are the 
q-gram distance metric on minimum edit distance metric. 

3.2.1 Q-gram distance 

A g-gram is a substring of length q. The g-gram distance between two strings is the 
number of g-grams they do not have in common. For example, denoting the g-gram 
distance between two strings X, and Y, as D q (X } Y), D 2 (ahmet,mehmet) = 3 (2-grams 
(bi-grams) not common to both = {ah, me, eh}), and D 3 (ahmet,mehmet) = 3 (3-grams 
(tri-grams) not common to both {ahm,meh,ehm}. 

3.2.2 Edit distance 

The edit distance measures how many unit operations are necessary to convert one string 
into another. The unit operations are insertion, deletion, replacement of single character 
and transposition of two adjacent characters. 

Definition 1 (Edit Distance) 8 

7 Just to make this clear we can give an example from Turkish. For instance 

ev+lAr+nHn (house+PLU+GEN) 

represents such a lexical form where A represents a low unrounded vowel (a and e in Turkish) which is 
unresolved for frontness, and H represents a high vowel (l, i, u, and ii) which is unresolved for other features. 
The +'s indicate the morpheme boundaries. When this lexical form is processed by the generation 
component of a two-level morphological analyzer, the surface form obtained is: 

evlerin 

where vowel harmony rules have resolved the A and the H, and the first n in the last morpheme has 
disappeared since the previous morpheme ends with a consonant. See Oflazer [11]. 

8 This is a slight modification of edit distance formulas given by Du and Chang [3] and by Wagner and 
Fischer, [15]. 
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Given two strings X and Y of length m and n respectively, then ed(X[m], Y[n]) 9 com- 
puted according to the recurrence below gives the minimum number of insertions, deletions, 
replaces and transpositions one needs to perform to convert one string to the other. 

ed(X[i + l],Y\j + l]) = ed(X[i\,Y\j]) if x i+1 = y j+1 

= 1 + min{ed(X[i — 1], Y[j — 1]), if both X{ = y 1+ \ 

ed(X[i + l],Y[j]), and x l+1 = y 3 
ed{X\i],Y\j + l])} 

= 1 + min{ed(X[i] } Y[j]), otherwise 
ed(X[i + l],Y\j]), 
ed(X\i],Y\j + l])} 

ed(X[0],Y[j}) = j l<J<n 

ed(X[i],Y[0}) = i \<i<m 



3.3 Recognizing and generating strings in the language 

We would like to capture and abstract the behavior of a morphological generator and 
analyzer for the given language by two finite state automata. 

Definition 2 A finite state generator M g = (P, 5,V, S, F) where P is a set of states, V 
is the output alphabet (of lexical morphemes), 6 is the state transition relation consisting 
of a set of triples (pi } p 1} Vk) indicating that the machine may traverse from state pi to 
state pj , and output (the morpheme) Vk (hence we label transition edges by v's), , S is 
the starting state, and F is a set of final states, generates, all correctly formed words of 
the language. It should be noted that it is possible to go from one state pi to another pj 
by more than one transition, outputting a different morpheme. We say a string Y\ ex is 
generated by M g , ifY\ ex is formed by concatenating, in order, the outputs of the machine 
as we traverse starting from S to one of the states in F. We denote by L(M g ) as the set 
of all lexical strings generated by M g . 

M g essentially captures the morphotactics of the language, and in general may contain 
circular transition sequences (as is the case in Turkish). Applying the function surfaceQ 
to a string generated by M g will give us a valid surface string in the language. We also 
have a finite state recognizer M r which recognizes whether given surface strings are in the 
language or not. When a word in the language is input to M r , if M r reaches one of its 
final states, the input surface word is a legal word in the language; hence M r implements a 
spelling checking functionality for the language. Figure 1 depicts the finite state generator 
defined above, where the lexical forms of the morphemes label the edges between states, 
and states with double circles are the final states. Typically these will be very large finite 
state machines with hundreds to thousands of states. 

9 We may occasionally drop the index of one or both arguments to indicate that we are referring to 
the whole string. 
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Figure 1: The finite state generator embodying the morphotactics 

3.4 Formal description of the spelling correction problem 

We can now define the spelling correction problem as: 

Definition 3 Given an incorrect word X (rejected by M r ), and an edit distance threshold 
t, find the solution set of possible correct words S(X } t) = {Y\ed(X } Y) < t and Y = 
sur f ace(Yi ex ) and Y lex G L(M g )}. 

In the context of the morphotactics graph shown in Figure I, the problem can also be 
stated as "finding all paths from the start state (node) to all final states (nodes) such 
that the edit distance between the given misspelled string and the string generated by 
applying the surface () function to the concatenation of the labels of the arcs along such 
a path is less of equal to a given threshold." This is depicted in Figure 2. Obviously the 
search for such paths has to be fast. 

We will now consider two subproblems of the problem. 

3.5 Determining the root 

Presenting alternatives for a given incorrect string X requires determination of all pos- 
sible roots. The criteria used to select roots are based on the edit distance between the 
(surface form) of a root and the prefixes of X. If any root word has an edit distance 
from some prefix of the misspelled word, less than the threshold t, then it is a candi- 
date root. An example from Turkish makes this clear. For the misspelled Turkish word 
X = kalayhlamak, kalayla and kalas (among others) are possible roots when t = 1 be- 
cause ed(kalayhla } kalayla) = ed(kalay } kalas) = 1. However, yatay is not a possible root 
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Figure 2: A path denoting a possible word in the language. The shaded area symbolizes 
the section of the graph to be searched. 

since ed(kala } yatay) = 3 > 1, ed(kalay } yatay) = 2 > 1, and ed(kalayh } yatay) = 3 > 1. 
This observation leads to the following definition: 

Definition 4 The set of all the possible roots for the incorrect word X is, PR(X } t) = 
{r | eo?(X[z], r) < t and f < i < m and r £ R}. 

In general, the cardinality of R - the set of all roots- is usually in the tens of thousands, 
thus one needs a fast search algorithm that works on a pre-constructed data structure for 
efficient determination of PR(X } t). We have chosen to represent the g-gram information 
associated with root words with an inverted bit vector structure so that the bit-vector 
corresponding to a g-gram has l's at positions corresponding to the root words containing 
that g-gram. Since the root list is static, 10 such a structure can be constructed off-line, 
and can be accessed randomly by using the g-gram as a key. Let us denote by k, the 
number of g-grams in a root that we would like to consult, and by t q , the number of of 
g-grams we are willing to leave out and yet call the root a possible candidate root. To 
generate the set of such roots, we take the hrst k g-grams of the incorrect word and then 
( k \ 

consider all I ^ ^ J subsets of the (k — t q ) g-grams. For each such subset, we intersect 

the bit vectors corresponding to the g-grams in that subset. We then union the bit vectors 
resulting from each subset. The resulting bit vector then has l's corresponding to root 
words which are "close" to a prefix of the misspelled word X. These roots are then filtered 
by the edit distance constraint in Definition 4 to compute PR(X } t). The parameters k 
and t q are in general fixed once according to the average length of the root words in the 

10 We can always deal with newly added root words in a similar fashion using different set of such bit 
vectors. 
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language. 



3.6 Generating candidate words from a given root 

Assuming that we have a set of root words found as described above, we now have to 
generate words in the language having this root, that do not deviate from the given 
misspelled string by more than the threshold. 

We will first consider solutions where the root portion of the word may be misspelled 
and the rest may be okay. We call such solutions as being on the left edge of the word. 

3.6.1 Getting solutions on the left edge 

The edit distances between an element r of PR(X } t) and certain prefixes of X are between 

and t. Sometimes, these distances are equal to t, which means no further mismatches 
between X and Y - the candidate string- are to be tolerated. In such cases, there is 
no need for further checking by generating a morpheme sequence. Just concatenating 
the portion of X that remains after aligning r with a prefix of X, to r\ exi and then 
generating the surface string will give us candidate Y strings. However, determination of 
the alignment of the root word r with X is somewhat tricky because the root in X may 
be deformed. 

Let us now define a new edit distance measure between r, an element of PR(X } t), and 
X. This is the minimum of the edit distances between r and any prefix of X. 

Definition 5 The prefix edit distance between r and X is pred(X,r) = min{ed(X[i],r) | 

1 < i < ra}. 

Definition 6 The set of alignment indexes of r in X is index(X } r) = {i | ed(X[i],r) = 
pred(X } r)}. 

For the example given before 

pred(kalayhlamak } kalayla) = 1 and index(kalayhlamak } kalayla) = {8} 
and pred(kalayhlamak } kalas) = 1, and index(kalayhlamak } kalas) = {4, 5}. 

When pred(X } r) = t, the remaining part of X after alignment with the root r must 
completely occur in Y after r to satisfy ed(X, Y) < t. That is, Y must be in the form 
Y = surface(concatenate(ri eX} X[i + 1 : ra])), i £ index(X } r). For the example above, 
the candidate from root kalayla, is kalaylamak } which happens to be the correct solu- 
tion and hence is accepted by M r . The candidates due to kalas are kalashlamak and 
kalashylamak, both of which are rejected by M r . Constructing Y } s for all the elements 
of the index set and all elements of the candidate root set, gives all possible solutions on 
the edge. 
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3.6.2 Generating candidate words 



Getting solutions on edge will ease the computation of the correct word if the erroneous 
part happens to be in the root, but it does not solve the problem completely. The solution 
requires a generate and test probing of the graph finite-state generator M g , starting with 
the start state S. We now have to find all the paths from this state to one of the final state 
using the roots in PR(X } t), so that when the morphemes along this path are concatenated 
and surface string is generated, it is within an edit distance t of X. 

When the search starts morphemes are concatenated and the length of the candidate 
lexical string Y\ ex increases. After one step of the search, the partial surface string Y is 
compared with a suitable prefix of X. In most of the cases the candidate Y will deviate 
from these prefixes of X by more than the threshold without reaching a final state, so that 
it can no longer lead to a viable solution. In such cases we do not consider any further 
transitions from that state. 

The following theorem from Du and Chang [3] helps us to determine when a partial 
candidate Y will not yield any result. 

Theorem 1 The error matrix for all prefixes of X and Y , is defined as H mXn where 
H(i,j) = ed(X[i] } Y[j]) Assume that m > n and let d = m — n. Then, the sequence of 
elements of H, along the path H(l, 1) - H(2, 1) - H(3, 1) - ... - H(d +1,1)- H(d + 
2, 2) — ... — H(m } n), are non-decreasing . 

Proof: See Du and Chang [3]. 

Theorem 1 determines a non-decreasing path in error distance matrix H mXn . This is 
not exactly what we need since the theorem requires that the length of the candidate 
string Y be known. In our case, we know that this length has to be in the range m — t to 
m + t for Y to be a candidate. 

3.6.3 Limiting search during word generation 

Due to the limitation above, we can not cut a branch of the search by looking at only 
a single path in H mXn as defined in Theorem 1. First we construct H for the current 
(possibly partial) Y, then consider column n (n being the current length of Y), and then 
find the minimum of the edit distance values along this column between rows n — t and 
n + t inclusive. If this value exceeds the threshold t, then there is no point in further 
pursuing this path, i.e., this Y will not lead to any solution. Formally, we define a cut-off 
distance metric: 

Definition 7 (Cut-off* distance) 



cuted(X[m] } Y[n]) 



( 



min{H[i } n] 
min{H[i } n] 
min{H[i } n] 



1 < i < n + t} 
n — t < i < n -\- 1} 
n — t < i < m} 



if n < t 

if t < n < m 

if m < n < m + t 

if m + t < n 



n — m 



9 



Candidate String 





o 
o 

d 

CP 

o 
o 

a 





_ 

1 ] 
i ; 
i | 
i | 
i ( 
i | 
i 

i ' 




i 
i 

i | 
i | 
i . 
i ' 

U 1 | 
\ N \ ' 1 

Values at possible 




N N 1 ! 

\ Cut-Off Paths i 




f N N 

\ \ ! 

^ \ \ 1 




^ \ \ i \ ; 

s \ 1 \ 

< > 
\ \ \ 
s s 1 \ 
\ \ | \ 




x 1 \ . \ 
\ \ \ 



Figure 3: Determination of Cut-Off Paths in H mXn 

The idea is similar to pred(X } r) defined earlier, in that prefixes of X are again con- 
sidered. If the cut-off edit distance between X and the current Y does not exceed the 
threshold, further transitions along from the state in M g currently reached by Y\ exi have 
to be pursued. 

After these observations we can state our algorithm for word generation, by searching 
the morphotactic graph, as follows: 



Compute PR(X,t) 

Initialize C(X,t) to the empty set 
for all r <G PR(X,t) 

/* push root and note to start search on to the stack */ 

PUSH((r /e:c , p rie J) 
while stack not empty 

P0P( (XiexiPi) ) /* pop the next state to check */ 
for all pj (p t} p 1} v) G 5 

Y = sur f ace(Yi ex ) /* n is the current length of Y */ 
if cuted(X[m] ,Y[n]) < t 

Push( (concat(Y ex ,v) ,p 3 )) 
if ed(X[m],Y[n]) < t and p 3 G F , 
then insert Y into C(X,t) 
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Theorem 2 The algorithm above produces exactly the solution set C(X } t) when PR(X } t) 
is given. 



Proof: Every element of PR(X, t) is pushed into the stack. If Y G C(X, t) then Y\ ex G 



3.7 Changes to the left of the morpheme boundary 

In during the affixation process, some characters to the left of the morpheme boundary 
may be deleted or modified, though such modification will not be reflected to the partial 
surface form until a subsequent morpheme is added. 11 For example in Turkish, one can 
have a situation where the lexical form gel+AcAk+Hm will have the surface form gelecegim, 
yet one may not know when the second morpheme is added to the first morpheme (the 
root) the last k gets changed to a glide g, when a third morpheme is added. To handle 
these cases, for morphemes ending in (possibly a sequence of) characters that may undergo 
such changes, we can temporarily increase the threshold accordingly during edit distance 
matches. 



An essential part of the spelling correction problem is the ranking of the candidate solu- 
tions. Candidate solutions can be ordered by increasing edit distance to the misspelled 
string. But when the number of solutions with the same edit distance is large, it is difficult 
to choose some subset of meaningful solutions to present the user. The problem is further 
complicated by the fact that the correct solution is usually determined by syntactic and 
semantic context and is dependent at least on the relative frequency of usage of the root 
words. 

We have opted used a model of spelling errors based on certain statistics we have about 
types of spelling errors people have made in typed Turkish text. Our observation from 
our sample of misspelled words is that 23.1% of misspelled strings contain replacement 
errors, 22.2% contain a deletion, 17.3% contain an addition and 3.3% contain transposition 
errors. However the most dominating error type within replacements (with 34%) is the 
replacement of s-s, c-c, l-i 6-o,ii-u, a-e pairs- all except the last one being the result of 
typing Turkish using a non- Turkish keyboard lacking Turkish characters or composing 
Turkish characters in complicated ways. 

These results give us about the heuristic that we can use in ranking. First we give high 
priority to solutions that can be converted to misspelled string by replacement (especially 
as above). Then we must prefer longer solutions because deletion and replacement of 

11 We assume that the changes induced on the surface form by a new morpheme affect a very small 
postfix of the stem constructed so far. 




of states in M g S pi 1 — ^ pi 2 — ^ pi 3 ■ ■ ■ pi 
e(concat(ri eX} v H , v l21 . . . , v ik )) and for all 
. . . , Vij)) 1 < j < k we have cuted(X } Yj) < t. 
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4 Ranking the Candidate Solutions 



n 



characters occurs more frequently. Transpositions are of lower priority as the frequency 
of this error is very low in the statistics. 



5 Results from experiments with spelling correction 
in Turkish 

We first present a spelling correction example from our implementation where we used 
bi-grams (q = 2), and we chose A; as 3 and t q as 2. 

EXAMPLE 

Misspelled word: gaismalariyla 
Threshold t: 2 

Solutions on left edge: yazismalariyla yatismalariyla 

yapismalariyla yakismalariyla 
takismalariyla s ay l smal any la 

may l smal any la katismalariyla 
kapismalariyla kakismalariyla 
ka§ismalariyla gikismalariyla 

Candidate Roots: 12 gag gaki gal gall gam gan gap gar gat gati gav gay 

gag gak gaki§ gal gali§ gap gat gati§ gav 

Solutions: 13 Lexical Surface 



Edit distance 1 



Edit Distance 2 



gat+H§+mA+lArH+ylA 

gap+H§+mA+lArH+ylA 

gali§+mA+lArH+ylA 

gav+mA+lArH+ylA 

gav+H§+mA+lAr+Hm+ylA 

gav+H§+mA+lAr+Hn+ylA 

gav+H§+mA+lArH+ysA 

gat+Hl+mA+lArH+ylA 

gat+mA+lArH+ylA 



g at l smal ar ly 1 a 
g ap l §mal ar ly 1 a 

gall smal any la (correct form) 

gavmalariyla 

g av l smal ar lml a 

gavismalarmla 

gavismalariysa 

gatilmalariyla 

gatmalariyla 



The algorithm described above was tested on a set of 141 randomly selected incorrect 
words from Turkish text. Among these misspelled words, 14% had edit distance of 2, and 
the remaining 86% had edit distance 1, to their intended correct form. The morphological 
analyzer and generator that we used was our two-level specification for Turkish [11], de- 
veloped using the PC-KIMMO system. This system has a rather comprehensive coverage 

12 The duplicate entries in the list of candidate roots for the example, are in fact not duplicate; they 
have different part-of-speech categories and hence different morphotactics. 
13 A small subset of the whole solution set is given here. 
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of Turkish morphology and uses a root lexicon of about 24,000 words. It is, however, 
rather slow and can analyze only about 2 forms per second and can generate about 50 
forms a second on Sun Sparcstations. So, instead of using timings, we counted the number 
of times the morphological analyzer and generator, and the edit distance computations, 
were called as these were the most expensive operations our algorithm. 14 

These statistics show the average number of morphological recognitions and gener- 
ations, and the edit distance operations required, and the number of correct solutions 
offered per misspelled input word. The last column indicates the percentage of cases the 
intended correct form was found. The results in Table 1 are for threshold t = 1 and 
the results in Table 2 are for threshold t = 2. In both cases, bi-grams were used with 
t q = 2. We varied k (which determines how many bi-grams from the beginning of the 
incorrect word are to be considered,) between 3 and 5. This range was considered because 
according to some limited statistics we have on Turkish text, the average root length is 
about 4.5 characters. Choosing k = 3 allows more deformed roots to be handled at the 
expense of more computation, while choosing k = 5 sometime will not find roots with 
minor deformations but it runs faster. 



Table 1: Average number of operations per misspelled word, for t = 1 



k 


Recognitions 


Generations 


Edit Distance 
Operations 


Solutions 
Offered 


% Accuracy 


3 


30.9 


311.2 


2498.4 


3.6 


95.1 


4 


10.4 


194.7 


1068.8 


2.4 


78.2 


5 


3.9 


88.5 


471.7 


1.5 


54.0 



The ranking procedure was tested on the similar set of data. Only the size of test data 
was increased but the percentages among the type and values of edit distances remained 
essentially the same. The results of the performance of the ranking procedure are given 

14 Although, our PC-KIMMO based morphological analyzer and generator that we have used for this 
study is rather slow, we have now ported our morphological analyzer system to the XEROX TWOL 
system by Karttunen [7], and intend to integrate it to our system. This system can recognize and 
generate Turkish forms in about a millisecond on Sun Sparcstations. With this system it will be possible 
to generate all solutions in about 1 to 2 seconds for t = 1 and in a few seconds for t = 2, on Sun 
Sparcstations. 



Table 2: Average number of operations per misspelled word, for t = 2 



k 


Recognitions 


Generations 


Edit Distance 
Operations 


Solutions 
Offered 


% Accuracy 


3 


108.4 


4462.0 


20680.4 


52.0 


95.1 


4 


46.5 


2247.8 


10386.6 


35.5 


78.2 


5 


13.6 


817.1 


3799.9 


20.3 


54.0 
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Table 3: The performance ol the ranking procedure 



Edit Dist. 


Given in First Pos. 


Given 


Not Given 


1 


75.8% 


20.7% 


3.5% 


2 


28.2% 


51.2% 


20.6% 


3-4 


5.5% 


25.0% 


70.5% 



in the Table 3. 

6 Conclusions 

This paper has presented a spelling correction algorithm for agglutinative languages that 
is based on a two-level morphological generator and analyzer, and a intelligent generate 
and test search procedure. The algorithm uses a g-gram based approach to determine 
the candidate roots words, and then from each root word, generates valid forms in the 
language, that are guaranteed not to deviate from the given misspelled string by more than 
a threshold, using morphological generator. We have applied this approach to Turkish, 
and our results indicate that we can find the intended correct word in 95% of the cases 
and offer it as the first candidate in 74% of the cases, when the edit distance is 1. We feel 
that using k = 3 and t = 1, we get a satisfactory (functional) performance for Turkish. 
We can certainly improve on the ranking results by incorporating root usage statistics. 
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